Introduction to Open Data Science

About the project

In this course the aim is to practice coding, recoding, programming and to analyze and visualize open data.

Link to my github repository:

https://github.com/hhautakangas/IODS-project


Data wrangling R code

## Heidi Hautakangas, 29.01.2017
## Week 2: Regression and model validation

# DATA WRANGLING

# 2. Read data 

LEARN <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", as.is=T, header=T)
dim(LEARN)
str(LEARN)

attach(LEARN)
# Data contains 183 observations and 60 variables. 59 variables are integers and one variable is character.

# 3. Subsetting data gender, age, attitude, deep, stra, surf and points by combining questions in the learning2014 data
# check names
names(LEARN)

# create new variable Deep (back to original scale divide by 12)
d_sm <- D03+D11+D19+D27
d_ri <- D07+D14+D22+D30
d_ue <- D06+D15+D23+D31

LEARN$Deep <- (d_sm+d_ri+d_ue)/12

# create new variable Surf (back to original scale divide by 12)
su_lp <- SU02+SU10+SU18+SU26
su_um <- SU05+SU13+SU21+SU29
su_sb <- SU08+SU16+SU24+SU32

LEARN$Surf <- (su_lp+su_um+su_sb)/12

#create new variable Stra (back to original scale divide by 8)
st_os <- ST01+ST09+ST17+ST25
st_tm <- ST04+ST12+ST20+ST28

LEARN$Stra <- (st_os+st_tm)/8

# create subset that contains following variables: gender, Age, Attitude, Points, Deep, Stra, Surf
# and exclude observations where points is zero
LEARNsub <- subset(LEARN, Points != 0, select=c(gender, Age, Attitude, Points, Deep, Stra, Surf))

# check subset dimension, should be 166 obs and 7 variables
dim(LEARNsub)

# 3. Set iods folder as working directory
setwd("~/Documents/IODS-project")

# write subset as a new file in the data folder
write.table(LEARNsub, "~/Documents/IODS-project/data/learning2014sub.txt", quote=FALSE, row.names = FALSE, col.names = TRUE)

# read data again
LEARNsub2 <- read.table("data/learning2014sub.txt", as.is=TRUE, header=TRUE)

# check that the structure is ok
str(LEARNsub2)
head(LEARNsub2)

WEEK 2

ANALYSIS

Dataset LEARN includes 166 observations and 7 variables. Variables are gender, Age, Attitude, Points, Deep, Stra and Surf, of which gender is a character string coded as F=female and M=male. Other variables are integers. Age is reported in years derived from the date of birth and mean age of the data is 26 years. Age interval is from 17 years to 55 years. Attitude measures global attitude towards statistics, and maximum points is 55 and minimum 14 points, mean scores being 31,43 and median 32. Points variable contains total exam points. The mean is 22,72 and the points varies between 7 to 33. Variables Deep, Surf and Stra are sumvariables which describe the type of approach in learning: either deep approach, or surface or strategic approach. Scale for these variables is from 1 to 5.

# read data
 
LEARN <- read.table("data/learning2014sub.txt", as.is=TRUE, header=TRUE)

# attach the data so there is no need to refer to the dataset used
attach(LEARN)
# structure and dimension of the data
dim(LEARN)
## [1] 166   7
str(LEARN)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ Deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Surf    : num  2.58 3.17 2.25 2.25 2.83 ...
# summary of the data
summary(LEARN)
##     gender               Age           Attitude         Points     
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##  Mode  :character   Median :22.00   Median :32.00   Median :23.00  
##                     Mean   :25.51   Mean   :31.43   Mean   :22.72  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##                     Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       Deep            Stra            Surf      
##  Min.   :1.583   Min.   :1.250   Min.   :1.583  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.667   Median :3.188   Median :2.833  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333

Plot 1. contains boxplots, distributions and scatter plots between all variables classified by gender. Gender is not distributed evenly, there are almost twice as many female observations (n= 110) than male observations (n=56). Most of the scatterplots shows that there is very negligible correlation between variables. Exceptions are between variables Attitude and Points with correlation 0.437, and between Deep and Surf with negative correlation 0.324. From both density and box plots, it can be seen, that there is difference between answers of male and female groups. This can also be observed from the correlations: for example, between variables Deep and Surf, males has strong negative correlation (-0.622), whereas correlation in females is negligible (-0.087). All distributions of female observations has more than one peak, so it can be assumed that they are not normally distributed. Also for male, distributions of Attitude and Points has clearly several peaks. Age is positively skewed in both genders.

## plot with title added
p <- ggpairs(LEARN, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
LEARNp <- p + ggtitle("Plot 1. Learning 2014 data summary")
LEARNp

Linear regression

At first, to model factors that could possibly explain success in exams, I fitted a regression model with the following three explanatory variables: Age, gender and Attitude, Points being a dependant variable. These results can be seen above (lm1). Only Attitude (p<8.34e-09) of the explanatory variables and regression intercept (p<8.34e-09) were statistically significant. 0.36 gain in the global attitude towards statistics increases exam points by one. Because there is no statistically significant relationship between Points and Age or between Points and gender, I removed them from the model and fitted a new one with only Attitude as an explanatory variable (lm2). Coefficient of Attitude remains almost the same: 0.35 (p<-4.12e-09) gain in the attitude increasing exam points by one. Attitude and Points scatter plot is presented in the Plot 2.

# linear regression
# dependent variable Points
# explanatory variables Age, gender and Attitude

lm1 <- lm(Points ~ Age + gender + Attitude, data=LEARN)
print(summary(lm1))
## 
## Call:
## lm(formula = Points ~ Age + gender + Attitude, data = LEARN)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## Attitude     0.36066    0.05932   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08
# second model Attitude as an explanatory variable
lm2 <- lm(Points ~ Attitude, data=LEARN)
print(summary(lm2))
## 
## Call:
## lm(formula = Points ~ Attitude, data = LEARN)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# scatter plot between Attitude and Points
p2 <- qplot(Attitude, Points, data = LEARN) + geom_smooth(method = "lm")
p3 <- p2 + ggtitle("Plot 2. Student's attitude vs. exam points")
p3

T-test tests an effect of a given variable to the explanatory variable by testing if the variable differs from zero statistically significantly, and produces T-value that is used to estimate if this null hypothesis can be rejected or accepted. In regression analysis, F-test tests whether explanatory variables can be used to model the variation of the dependant variable. For both models, F-statistics was statistically significant: in the first model p-value was less than 5.54e-08, and in the second model p-value was under 4.1e-09. F-test and R^2 are used to test power of the test. R^2 is the coefficient of determination which describes the proportion of the variation of the dependant variable that can be explained by explanatory variables in the regression model. It varies between 0 and 1, small R^2 indicating that explanatory variables explains only small amount of the variation of the dependant variable and high R^2 vice versa. In the first model, R^2 was 0.20, whereas in the second model it was 0.19. Adjusted R^2 considers also the amount of the explanatory variables and is used to compare the results of different regression models. Adjusted R^2 was 0.187 in the first model and 0.186 in the second model with only one explanatory variable.

Residual standard error describes standard deviation of the residuals and affects to the power of the test. The higher deviation and thus residual standard error, the smaller is the power of the test. Both of the models has quite big residual standard errors, approximately 5.32.

Linear regression assumes that the relationship between variables is linear. Errors are assumed to be normally distributed and not correlated. Homoskedasticity of the errors is also assumed, meaning that variance is constant and does not depend on the observation. Heteroskedasticity can be though handled, for example by weighting variables by the inverse of the variance. Leverage measures the amount of impact a single observation has on the model, and residuals vs leverage plot can be used to identify observations with unusually high impact.

For model validation, I examined these assumptions by producing three plots from the second model: Residuals vs Fitted, Normal QQ-plot and Residuals vs Leverage. Plot 3b. shows that the normality assumption of the errors holds. Plot 3a. is used to examine homoskedasticity. Observations are scattered and do not show any systematic pattern, so the homoskedasticity assumption is also valid. Plot 3c. shows that the leverage is regular and there is no observation with unusually high impact. Model is thus valid.

## Plots for validation
par(mfrow= c(2,2))
plot(lm2, which=c(1), main="Plot 3a. Model validation")
plot(lm2, which=c(2), main="Plot 3b. Model valitadion")
plot(lm2, which=c(5), main="Plot 3c. Model validation")


WEEK 3

ANALYSIS

Data

ALC data includes 35 variables with 385 observations. There are some background information, such as sex, age, address, family size and parent’s job and education. Main interest is in the study habits and grades and factors that might affect the study success. Habits of alcohol use are also examined. There are three types of variables: 1. binary variables for example sex, relationship status and extra-curricular activities, 2. numeric variables for example age and number of absences and 3. nominal variables, for example parent’s job. Age interval is from 15 to 22 and median age is 17.

# names and summary
names(ALC)
dim(ALC)
str(ALC)
summary(ALC)

Hypotheses

To study possible factors that might have relationship with high/low alcohol consumption, I chose following four variables: sex, absences, romantic and studytime. Sex and romantic are binary variables, and studytime and absences numeric variables. Hypotheses about their relationship between alcohol consumption are:

Sex:

H0: Sex does not affect on the alcohol consumption,

H1: Alcohol consumption differs between males and females

Romantic:

H0: Being in a romantic relationship does not affect on alcohol consumption

H1: Being in a romantic relationship lessens the alcohol consumption

Absences:

H0: There is no relatioship between alcohol consumption and number of absences

H1: Those whit higher alcohol consumption have more absences

Studytime:

H0: There is no relationship between studytime and alcohol consumption

H1: Those who study more use less alcohol

At start, these relationships are examined from graphics and tables.

# tables and plots
# table 1
ALC %>% group_by(high_use, sex) %>% summarise(count = n())
## Source: local data frame [4 x 3]
## Groups: high_use [?]
## 
##   high_use   sex count
##      <lgl> <chr> <int>
## 1    FALSE     F   156
## 2    FALSE     M   112
## 3     TRUE     F    42
## 4     TRUE     M    72
## barplot SEX
plot(factor(high_use)~factor(sex), xlab="Sex", ylab="high use of alcohol", main="Plot 1. Sex of the students")

Table 1. shows that in both sexes there are more of those with low alcohol comsumption, but the amount with high use in males is higher (72/184) than in females (42/198). This can be seen also in the plot 1. These findings support the alternative hypothesis that there is difference between males and females.

# table 2
ALC %>% group_by(high_use, romantic) %>% summarise(count = n()) 
## Source: local data frame [4 x 3]
## Groups: high_use [?]
## 
##   high_use romantic count
##      <lgl>    <chr> <int>
## 1    FALSE       no   180
## 2    FALSE      yes    88
## 3     TRUE       no    81
## 4     TRUE      yes    33
# barplot ROMANTIC
plot(factor(high_use)~factor(romantic),xlab="in a relationship", ylab="high use of alcohol", main= "Plot 2.  Relationship and high alcohol use")

From plot 2 it can be seen that there is no difference between relationship status and high alcohol consumption. From the same plot and also from table it can be also seen that there are less people in a romantic relationship than those who are in a romantic relationship (261:121). These findings supports the null hypothesis.

# table 3
ALC %>% group_by(high_use) %>% summarise(count = n(), mean_absences=mean(absences))
## # A tibble: 2 <U+00D7> 3
##   high_use count mean_absences
##      <lgl> <int>         <dbl>
## 1    FALSE   268      3.705224
## 2     TRUE   114      6.368421
# boxplot ABSENCES
g2 <- ggplot(ALC, aes(y =absences, x=high_use))
g2 + geom_boxplot() + ggtitle("Plot 3. Number of absences")

Table 3 shows the mean absences which seem to differ between high and low users, so that the ones who have high alcohol consumption have almost two times more absences compared to the ones with low alcohol consumption. This can be observed also from the plot 3. These support the alternative hypothese about high users having more absences.

#table 4
ALC %>% group_by(high_use) %>% summarise(count = n(), mean_weekly_study_time=mean(studytime))
## # A tibble: 2 <U+00D7> 3
##   high_use count mean_weekly_study_time
##      <lgl> <int>                  <dbl>
## 1    FALSE   268               2.149254
## 2     TRUE   114               1.771930
# barplot STUDYTIME 
spineplot(factor(studytime)~factor(high_use), xlab="high use of alcohol", ylab="weekly study time", main="Plot 4. Weekly study time")

Table 4. shows that the ones with low alcohol consumption uses on average more hours to study per week than the ones with high consumption. Difference can be seen in the plot 4. This suggests that the null hypothesis could possibly be rejected. To confirm these findings, they need to be tested also statistically.

Logistic regression and odds ratios

Because the responce variable high use is binary, these relationships can be best examined by logistic regression. From the summary of the fitted model (above), it can be seen that all other variables except romantic are statistically significant with p-value < 0.05. If stringent treshold is desired, are variables absences and sex statistically significant with p-value < 0.01. but studytime do not reach anymore acquired level. Studytime has negative coefficient and both absences and sex positive. Deviance is used to examine the goodness of fit of the model. When the model fits perfectly to the data, is the deviance zero. In a binomial model, there is overdispersion if the deviance is much higher than the degrees of freedom. Null deviance is 465.68 on 381 degrees of freedom which suggest that there is some overdispersion in the model. Odds ratio for absences is 1.09 with (1.05,1.45) confidence interval of 95%, which means that there is 1.09 higher risk to have high alcohol consumption when the number of absences are higher compared to the lower amount of absences. There is 2.2 higher risk for males to have high alcohol consumption than females with (1.35,3.62) confidence interval of 95%. Whereas weekly studytime has an opposite risk, those who study more has 0.67 lower risk to have high alcohol consumption with (0.49,0.91) 95%-confidence interval.These results support earlier observations from the plots and tables.

# logistic regression model with all variables
m1 <- glm(high_use ~ absences + studytime + romantic + sex, data = ALC, family = "binomial")
#  a summary of the model
summary(m1)
## 
## Call:
## glm(formula = high_use ~ absences + studytime + romantic + sex, 
##     family = "binomial", data = ALC)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1456  -0.8356  -0.6145   1.0895   2.0865  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.87199    0.41840  -2.084  0.03715 *  
## absences     0.09026    0.02314   3.900 9.61e-05 ***
## studytime   -0.39477    0.15916  -2.480  0.01312 *  
## romanticyes -0.20370    0.26174  -0.778  0.43643    
## sexM         0.78878    0.25124   3.140  0.00169 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 422.66  on 377  degrees of freedom
## AIC: 432.66
## 
## Number of Fisher Scoring iterations: 4
# coefficients of the model
coef(m1)
## (Intercept)    absences   studytime romanticyes        sexM 
## -0.87198682  0.09025851 -0.39477449 -0.20369623  0.78877844
# compute odds ratios (OR)
OR <- coef(m1) %>% exp
# compute confidence intervals (CI)
CI <-confint(m1) %>% exp
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.4181200 0.1829448 0.9470310
## absences    1.0944572 1.0481820 1.1479059
## studytime   0.6738320 0.4886211 0.9135379
## romanticyes 0.8157101 0.4840025 1.3541213
## sexM        2.2007065 1.3501218 3.6224923

Predictive power, training error and cross-validation

To compute predictive power of the model, statistically non-significant variable studytime is excluded from the model.

# model with only statistically significant variables
m2 <- glm(high_use ~ absences + studytime  + sex, data = ALC, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ absences + studytime + sex, family = "binomial", 
##     data = ALC)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1771  -0.8487  -0.5971   1.0986   2.1157  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.91782    0.41459  -2.214  0.02684 *  
## absences     0.08876    0.02308   3.846  0.00012 ***
## studytime   -0.40248    0.15930  -2.526  0.01152 *  
## sexM         0.79826    0.25065   3.185  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.27  on 378  degrees of freedom
## AIC: 431.27
## 
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use and add it in the data
probabilities <- predict(m2, type = "response")
ALC <- mutate(ALC, probability = probabilities)

# use the probabilities to make a prediction of high_use
ALC <- mutate(ALC, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = ALC$high_use, prediction = ALC$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     88   26
# tabulate the target variable versus the predictions
table(high_use = ALC$high_use, prediction = ALC$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.02617801 0.70157068
##    TRUE  0.23036649 0.06806283 0.29842932
##    Sum   0.90575916 0.09424084 1.00000000
## training error
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = ALC$high_use, prob = ALC$probability)
## [1] 0.2565445
# 10-fold cross-validation 
library(boot)
cv <- cv.glm(data = ALC, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2591623

There were 0.67 probability (258/385) for both predicted and observed to be false and 0.07 probability (26/382) for both predicted and observed to be true. The probability to false positives were 0.026 (10/382) and 0.23 probability (88/382) for false negatives. Probabilities are presented in the plot 5. Training error of the model is 0.26, and by 10-fold cross-validation, the test error is approximately 0.26 which happens to be same as in the DataCamp model.

# Plot 5. Predictions
g <- ggplot(ALC, aes(x = probability, y = high_use, col= prediction))
g + geom_point()+ggtitle("Plot 5. Prediction vs true values probabilites")


WEEK 4

Analysis

Boston Data

Boston data set about housing values in suburbs of Boston has 14 variables and 506 observations, of which 12 are continuous numerical variables and 2 discrete integers. It includes several variables of possible factors that might influence on housing values which are listed below.

-crim: per capita crime rate by town.

-zn: proportion of residential land zoned for lots over 25,000 sq.ft.

-indus: proportion of non-retail business acres per town.

-chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

-nox: nitrogen oxides concentration (parts per 10 million).

-rm: average number of rooms per dwelling.

-age: proportion of owner-occupied units built prior to 1940.

-dis: weighted mean of distances to five Boston employment centres.

-rad: index of accessibility to radial highways.

-tax: full-value property-tax rate per $10,000.

-ptratio: pupil-teacher ratio by town.

-black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

-lstat: lower status of the population (percent).

-medv: median value of owner-occupied homes in $1000s.

# load the data from MASS package
data("Boston")
## check how the data look like (structure and dimension etc.)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

To explore variables and their relationship further, I created plots about their distributions and correlations.

# Plots 
p <- ggpairs(Boston, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20)))
p1 <- p + ggtitle("Housing values in suburbs of Boston")
p1

Figure 1. Boston data variable

# correlation plot
cor_matrix<-cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)

Figure 2. Correlations

From distribution plots in Figure 1, it can be seen that almost all variables do not follow normal distribution. Only one variable, average numbers of rooms per dvelling (rm), seems to follow normal distribution. Other variables are either right skewed (crim, zn, chas, nox,dis, lstat and medv), left skewed (age, ptratio, and black) or clearly bimodal (indus, rad and tax). Also there seems to strong correlation between many of the variables. Correlations are plotted also in the Figure 2, where red indicates negative correlation and blue positive correlation and the size and color density how strong the correlation is. Darker and bigger indicates stronger correlation than small and light. For example, between lower status of the population (lstat) and median value of owner-occupied homes in $1000s (medv), has strong negative correlation. Whereas, between index of accessibility to radial highways (rad) and full-value property-tax rate per $10000 (tax) there is strong positive correlation. Whether these observations are statistically significant, can be confirmed by statistical tests.

## summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Skewnes of the variables can also been observed in the summary statistics (above). In crim, interval between min 0.00632 and max 88,97 is wide, but median is only 0.256. Results are difficult to interpret because of the scale used, so the data should be standardized to make it easier (below). Now the results are comparable with each other and easier to interpret.

# standardize data and print out the summary
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#names(boston_scaled)

Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) is a classification method that can be used to find those variables that best separate classes, to class prediction of new data and for dimensional reduction. Target variable can be either binary or multiclass variable. LDA assumes that variables follow normal distribution and each class share same covariance matrix. To perform with the target variable crime, data is divided into train and test.

#create a categorical variable
scaled_crim <- boston_scaled$crim

# summary of the scaled_crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)


# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide dataset to test and train sets
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
#  train set
train <- boston_scaled[ind,]
# and test set 
test <- boston_scaled[-ind,]

# linear discriminant analysis on the train set 
lda.fit <- lda(crime ~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2673267 0.2450495 0.2376238 0.2500000 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.9382261 -0.8957833 -0.16296517 -0.8542826  0.4340845
## med_low  -0.1258162 -0.2489654 -0.03371693 -0.5326769 -0.1361260
## med_high -0.3603953  0.0649858  0.17879700  0.3002694  0.2553563
## high     -0.4872402  1.0171306 -0.03844192  1.0680261 -0.4044577
##                 age        dis        rad        tax     ptratio
## low      -0.8761256  0.8789152 -0.6915643 -0.7435780 -0.48071361
## med_low  -0.3701875  0.2761187 -0.5596066 -0.4417981 -0.06764267
## med_high  0.3757218 -0.3124045 -0.3717481 -0.3129112 -0.23447099
## high      0.8291263 -0.8515309  1.6379981  1.5139626  0.78062517
##                black       lstat        medv
## low       0.37847800 -0.77613322  0.51390665
## med_low   0.31131016 -0.12312243 -0.01290242
## med_high  0.08527919 -0.06652266  0.27063613
## high     -0.71288614  0.84959859 -0.62871000
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.08409503  0.791470926 -0.76119958
## indus   -0.03667348 -0.204847917  0.28336801
## chas    -0.09165166 -0.070910023  0.06442484
## nox      0.46762084 -0.682204091 -1.39088577
## rm      -0.10022897 -0.177900652 -0.17533161
## age      0.34246363 -0.267915992 -0.40293466
## dis      0.04318413 -0.306609182 -0.20145106
## rad      2.99406694  0.977064038 -0.23407930
## tax      0.01038562 -0.092791398  0.72318439
## ptratio  0.13636292 -0.000223157 -0.26701032
## black   -0.11377215 -0.006155782  0.11474048
## lstat    0.21586097 -0.334448550  0.44485172
## medv     0.22266175 -0.432058844 -0.20421166
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9504 0.0348 0.0148
# Plot
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes, main="LDA")
lda.arrows(lda.fit, myscale = 2)

Figure 3. LDA

Figure 3 represents the LDA results, in which different colors represent different classes and the arrows show the impact of each predictor variables. Accessibility to radial highways seems to be strongly associated with high crime rates.

Classes can be predicted from the LDA model by using test data and this infromation can be used to test the performance of the model. Correct values are crosstabulated against predicted values.

# save test crime
correct_classes <- test$crime
## test set for prediction
test <- dplyr::select(test, -crime) 
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# and cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) 
##           predicted
## correct    low med_low med_high high
##   low       12       7        0    0
##   med_low    5      18        4    0
##   med_high   0       5       25    0
##   high       0       0        0   26

Model performs well with high class, all predictions are correct. Prediction of low is correct 16 times but predicted 2 times to be medium low. Medium low is predicted to be correct 15 times and wrong 14 times, so model performs weakly with then medium low values. Medium high is predicted to be correct 20 times and incorrect 4 times. Model thus performs well with all other except medium low.

Distance and K-means clustering

K-means is a method for clustering which uses the similarity of the objects to grouping or clustering. Number of clusters that best fit for the data can be determined for example by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Optimal number can be seen from figure 4 from the point where WCSS drops radically. This happens when there are 2 clusters.

data("Boston")
#scale
boston_scaled <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results to find optimal number of clusters
plot(1:k_max, twcss, type='b')

Figure 4. K-means clustering

# Plot 
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Figure 5. Variables by two cluster

In Figure 5, scaled variables are presented by the cluster. The difference between the clusters can be seen in almost all pictures. There is clear difference for example crime and accessibility to radial highways which was seen also in the LDA plot.